Skip to content

[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150

Merged
phu0ngng merged 5 commits into
NVIDIA:mainfrom
phu0ngng:update_nccl
Jun 30, 2026
Merged

[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150
phu0ngng merged 5 commits into
NVIDIA:mainfrom
phu0ngng:update_nccl

Conversation

@phu0ngng

Copy link
Copy Markdown
Collaborator

Description

Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR bumps the 3rdparty/nccl submodule to pick up a downstream fix for MAX_SUPPORTED_TOKENS_PER_RANK, and updates four EP test/bench launcher scripts to use a consistent NVLink detection method.

  • Submodule bump (808d2433a6b5de08): pulls in the NCCL fix for MAX_SUPPORTED_TOKENS_PER_RANK that affects Expert Parallelism collectives.
  • NVLink detection standardized: all four EP scripts now use nvidia-smi nvlink --status 2>/dev/null | grep -qE 'Link [0-9]+:.*GB/s' instead of the topology-matrix check (nvidia-smi topo -m), confirming links are active (showing real bandwidth) rather than merely present in the topology table.
  • Three scripts (cpp, jax test, jax bench) gain the NVLink guard for the first time; the PyTorch script replaces the old topology-based guard with the new one.

Confidence Score: 5/5

Safe to merge — the submodule bump is a targeted bug fix and the shell script changes only tighten the existing skip guards.

The only code change is a submodule pointer update and consistent NVLink detection guards across four launcher scripts. The new detection method (checking for active link bandwidth via nvidia-smi nvlink --status) is strictly more precise than the old topology-matrix check, and the worst failure mode is an over-skip on an unusual hardware configuration rather than a hang or data corruption. No TE library code is modified.

No files require special attention.

Important Files Changed

Filename Overview
3rdparty/nccl Submodule pointer updated to a6b5de08 to include the MAX_SUPPORTED_TOKENS_PER_RANK fix; no TE-side code changes required.
tests/pytorch/distributed/run_test_ep.sh Replaced nvidia-smi topo -m topology check with nvidia-smi nvlink --status active-link check; logic and skip semantics are preserved.
tests/cpp_distributed/run_test_ep.sh Adds NVLink active-link guard (new for this script) using the standardized nvidia-smi nvlink --status pattern.
tests/jax/multi_process_launch_ep.sh Adds NVLink active-link guard (new for this script) placed correctly after the GPU-count check.
examples/jax/ep/bench/run_ep_bench.sh Adds NVLink active-link guard (new for this script) placed correctly after the GPU-count check.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[EP test/bench script starts] --> B{GPU count >= 4?}
    B -- No --> C[SKIP: not enough GPUs]
    B -- Yes --> D{nvidia-smi nvlink --status\nmatches 'Link N:.*GB/s'?}
    D -- No --> E[SKIP: NVLink not active\nPCIe-only or unsupported]
    D -- Yes --> F[Run EP test / bench]
    F --> G{Exit code == 0?}
    G -- Yes --> H[PASS]
    G -- No --> I[FAIL]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[EP test/bench script starts] --> B{GPU count >= 4?}
    B -- No --> C[SKIP: not enough GPUs]
    B -- Yes --> D{nvidia-smi nvlink --status\nmatches 'Link N:.*GB/s'?}
    D -- No --> E[SKIP: NVLink not active\nPCIe-only or unsupported]
    D -- Yes --> F[Run EP test / bench]
    F --> G{Exit code == 0?}
    G -- Yes --> H[PASS]
    G -- No --> I[FAIL]
Loading

Reviews (6): Last reviewed commit: "Detect active NVLink via nvlink --status..." | Re-trigger Greptile

Comment thread 3rdparty/nccl Outdated
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

2 similar comments
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@jberchtold-nvidia jberchtold-nvidia left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

Pipeline #56213025

phu0ngng added 5 commits June 29, 2026 11:52
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng phu0ngng merged commit 90baf02 into NVIDIA:main Jun 30, 2026
46 of 54 checks passed
@phu0ngng phu0ngng deleted the update_nccl branch June 30, 2026 07:37
KshitijLakhani pushed a commit that referenced this pull request Jun 30, 2026
…NS_PER_RANK (#3150)

* nccl with relax num_dispatch_tokens%64!=0

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Skip EP tests/examples on nodes without NVLink

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants